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Abstract 

Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s which 
have been deeply studied in infinite horizon but for which scarce non-asymptotic results exist. 
In this paper, we focus on a non-asymptotic study of the regret and address the following 
question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In 
our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized -NS 
algorithms. 

We also extend to the multi-armed case some convergence properties of penalized-NS algo¬ 
rithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we 
establish some new sharp mixing bounds for these processes. 


Keywords: Regret, Stochastic Bandit Algorithms, Piecewise Deterministic Markov Processes 

1 Introduction 

The so-called Narendra-Shapiro bandit algorithm (referred to as NSa) was introduced in [TS] and 
developed in (18) as a linear learning automata. This algorithm has been primarily considered 
by the probabilistic community as an interesting benchmark of stochastic algorithm. More pre¬ 
cisely, NSa is an example of recursive (non-homogeneous) Markovian algorithm, topic whose almost 
complete historical overview may be found in the seminal contributions of m and m- 
NSa belongs to the large class of bandit-type policies whose principle may be sketched as follows: 
a d-armed bandit algorithm is a procedure designed to determine which one, among d sources, is 
the most profitable without spending too much time on the wrong ones. In the simplest case, the 
sources (or arms) randomly provide some rewards whose values belong to {0; 1} with Bernoulli 
laws. The associated probabilities of success (p i, ...,pd) are unknown to the player and his goal is 
to determine the most efficient source, i.e. the highest probability of success. 

Let us now remind a rigorous definition of admissible sequential policies. We consider d 
independent sequences (A l n ) n >o of i.i.d. Bernoulli random variables Each A l n represents 

the reward associated with the arm i at time n. We then consider some sequential predictions 
where at each stage n a forecaster chooses an arm receives a reward A L* and then uses this 
information to choose the next arm at step n + 1. As introduced in the pioneering work (20j . the 
rewards are sampled independently of a fixed product distribution at each step n. The innovations 
here at time n are provided by (I n ,A^) and we are naturally led to introduce the filtration 

(•Fn)n>o := (cr((/i, A 1 ^),..., (/„, A ^))) . In the following, the sequential admissible policies 

will be a {T n ) n >o (inhomogeneous) Markov chain. We also define another filtration by adding all 
the events before step n and observe that (F „) n >o := (cr((/i, (Aj)i<y< d )..., (/„, (A^)i< J -< d )))„> 0 . 
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To sum-up, T n contains all the results of each arm between time 1 and n although J~ n only provides 
partial information about the tested arms. 

In this paper, we focus on the stochastic NSa whose principle is very simple: it consists in sampling 
one arm according to a probability distribution on { 1 ,. .. ,d}, and in modifying this probability 
distribution in terms of the reward obtained with the chosen arm. From this point of view, this 
algorithm bears similarities with the EXP3 algorithm (and many of its variants) introduced in [5], 
Among other close bandit algorithms, one can also cite the Thompson Sampling strategy where the 
random selection of the arm is based on a Bayesian posterior which is updated after each result. 
We refer to [I) for a recent theoretical contribution on this algorithm. 

Instead of sampling one arm sequentially according to a randomized decision, other algorithms 
define their policy through a deterministic maximization procedure at each iteration. Among 
them, we can mention the UCB algorithm [4] and its derivatives (including MOSS [2j and KL- 
UCB [5]), whose dynamics are dictated by an appropriate empirical upper confidence bound of the 
estimated best performance. 

Let us now present the NSa algorithm. In fact, we will distinguish two types of NSa: crude-NSa 
and penalized-NSa. Before going further, let us recall their mechanism in the case of d = 2 (the 
general case will be introduced in Section [2]). Designating X n as the probability of drawing arm 1 
at step n and (' y n )n>o as a decreasing sequence of positive numbers that tends to 0 when n goes 
to infinity, crude-NS is recursively defined by: 

! 7 „+i (1 — X n ) if arm 1 is selected and wins 
— 7 n+ i X n if another arm is selected and wins ( 1 ) 

0 otherwise 

Note that the construction is certainly symmetric, i.e., 1 — X rl (which corresponds to the probability 
of drawing arm 2) has a symmetric dynamics. The long-time behavior of some NSa was extensively 
investigated in the last decade. To name a few, in m and some convergence and rate of 

convergence results are proved. However, these results strongly depend on both ( 7 n ) and the 
probabilities of success of the arms. In order to get rid of these constraints, the authors then 
introduced in m a penalized NSa and proved that this method is an efficient distribution-free 
procedure, meaning that it unconditionally converges to the best arm on the unknown probabilities 
Pi and P 2 ■ The idea of the penalized-NS algorithm is to also take the failures of the player into 
account and to reduce the probability of drawing the tested arm when it loses. Designating (p n ) n >0 
as a second positive sequence, the dynamics of the penalized NSa is given by : 

7 „+i(l — X n ) if arm 1 is selected and wins 

— 7 n+i^n if arm 2 is selected and wins . 

—pn'jn+iXn if arm 1 is selected and loses 

Pn+i 7 n+i(l — X n ) if arm 2 is selected and loses. 

Performances of bandit algorithms. In view of potential applications, it is certainly important 
to have some informations about the performances of the used policies. To this end, one first needs 
to define what is a “good” sequencial algorithm. The primary efficiency requirement is the ability 
of the algorithm to asymptotically recover the best arm. In m, this property is referred to as 
the infallibility of the algorithm. If without loss of generality, the first arm is assumed to be the 
best, (i.e. that pi > max{p 2 , • • •, Pd}) and if Xn' 1 denotes the probability of drawing arm 1, the 
algorithm is said to be infallible if 

P(XW 1) = 1. (3) 

An alternative way for describing the efficiency of a method is to consider the behaviour of the 
cumulative reward S n obtained between time 1 and n: 

n 

S n :=E A k- 

k=1 
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In particular, in the old paper [20], Robbins is looking for algorithms such that 

fl5[»Sri] ra—H-oo n 

Pi -0. 

n 

This last property is weaker than the infallibility of an algorithm since the Lebesgue theorem as¬ 
sociated to § implies the convergence above. 


A much stronger requirement involves the regret of the algorithm. The regret measures the gap 
between the cumulative reward of the best player and the one induced by the policy. The regret 
R n is the J^-measurable random variable defined as: 


n 

R n := max Y' \A * 3 k - A[ k 
L K K 


(4) 


A good strategy corresponds to a selection procedure that minimizes the expected regret E R n , 
optimal ones being referred to as minimax strategies. 

The former expected regret cannot be easily handled and is generally replaced in statistical analysis 
by the pseudo-regret defined as 


R n := max EV \a{ A[ k 
l<j<d l K K 

~ k—1 


Since pi > Pj,Vj ^ 1, R n can also be written as 

n / n \ 

Rn = ^ E (4)-E = 

k= 1 Vfc=l / 

A low pseudo-regret property then means that the quantity 

E[S n 


(5) 


n p i 


E [S n 


n 


n p i 


has to be small, in particular sub-linear with n. The quantities R n and R n are closely related and 
it is reasonable to study the pseudo-regret instead of the true regret, owing to the next proposition: 

Proposition 1.1. ( i) For any ( T n ) n >Q-measurable strategy, we obtain after n plays: 


0 < Ei?„ — Rr, < 


nlogd 


(ii) Furthermore, for every integer n and d and for any (admissible) strategy, 

sup E [R n ] > 

Pl>P2>--->Pd ^ 


We refer to Proposition 34 of [3] for a detailed proof of ( i ) and to Theorem 5.1 of [5] for (ii). As 
mentioned in (ii), the bounds are distribution-free (uniform in p)Q Since the MOSS method of 3] 
satisfies R n < 25 \fnd, (?') and (ii) show that a non-asymptotic distribution-free minimax rate is 
on the order of y/n. 

In particular, a fallible algorithm (meaning that P(X n 1) < 1) necessarily generates a 

linear regret and is not optimal. For example, in the case d = 2, the dependence of R„ in terms of 
(X n ) is as follows: 


.k =1 

> (pi - P 2 )P(X 00 = 0 ) xn. 

1 The rate orders are strongly different if a dependence in p is allowed. 
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Rn = Pin - ^2(piE[X k ] +p 2 E[l - X k ]) = (pi — p 2 )E 

fc =i 


( 6 ) 














Objectives. In this paper, we therefore propose to focus on the regret and to answer to the 
question “Are NSa competitive from a regret viewpoint? In the case of positive answer, what are 
the associated upper-bounds ?” 


Due to some too restrictive conditions of infallibility, it will be seen that the crude-NSa cannot be 
competitive from a regret point of view. As mentioned before, the penalized NSa is more robust 
and is a priori more appropriate for this problem. More precisely, the penalty induces more balance 
between exploration and exploitation, i.e. between playing the best arm (the one in terms of the 
past actions) and exploring new options (playing the suboptimal arms). In this paper, we are going 
to prove that, up to a slight reinforcement, it is possible to obtain some competitive bounds for 
the regret of this procedure. The slightly modified penalized algorithm will be referred to as the 
over-penalized -algorithm below. 

Outline. The paper is organized as follows : Section [2~T| provides some basic information about 
the crude NSa. Then, in Section [2~2| after some background on the penalized Nsa, we introduce a 
new algorithm called over-penalized NSa. 

Section [3] is devoted to the main results: in Theorem T2 we establish an upper-bound of the 
pseudo-regret R n for the over-penalized algorithm in the two-armed case and also show a weaker 
result for the penalized NSa. 

In this section, we also extend to the multi-armed case some existing convergence and rate of 
convergence results of the two-armed algorithm. In the “critical” case (see below for details), the 
normalized algorithm converges in distribution toward a PDMP (Piecewise Deterministic Markov 
Process). We develop a careful study of its ergodicity and bounds on the rate of convergence 
to equilibrium are established. It uses a non-trivial coupling strategy to derive explicit rates of 
convergence in Wasserstein and total variation distance. The dependence of these rates are made 
explicit with the several parameters of the initial Bandit problem. 

The rest of the paper is devoted to the proofs of the main results: Section [4] is dedicated to 
the regret analysis, and Section [5] establishes the weak limit of the rescaled multi-armed bandit 
algorithm. Finally, Section [6] includes all the proofs of the ergodic rates. 


2 Definitions of the NS algorithms 

2.1 Crude NSa and regret 

The crude NSa ([I]) is rather simple: it defines a (J r „) ra >o Markov chain (A" n ) n >o and I n is a random 
variable satisfying: 


P(/„+i = 1| JF n ) = X n and P(/ n+ i = 2\F n ) = 1 - X n 

The arm I n+ 1 is selected at step n+ 1 with the current distribution ( X n , 1 — X n ) and is evaluated. 
In the event of success, the weight of the arm I n+ 1 is increased and the weight of the other arm is 
decreased by the same quantity. The algorithm can be rewritten in a more concise form as: 


x n+1 = X n + 7n+i (1 J„ +1 =1 - X n )A^ . (7) 

The arm i at step n succeeds with the probability pi = P (A l n = 1) and we suppose w.l.o.g. that 
Pi > P 2 so that the arm 1 is the optimal one. 

As pointed in ([6]), we obtain that 


Rn = (pi -p 2 )E 


_k =1 


This formula is important regarding the fallibility of an algorithm. In particular, it is shown in 
m that for any choice 7 „ = C(n +1) “ with a £ (0,1) and C > 0 or 7 „ = C/(n + 1) with C > 1, 
the NSa |7]) may be fallible: some parameters (pi,p 2 ) exist such that (X„)„>o a.s. converges to a 
binary random variable X <*, with P(Aoo = 0) > 0. In this situation, for large enough n, we have: 

Rn > (pi - p 2 )P(A' oc , = 0)xn >> ^/n. 
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It can easily be concluded that this method cannot induce a competitive policy since some “bad” 
values of the probabilities (pi,p 2 ) generate a linear regret. 

2.2 Penalized and over-penalized two-armed NSa 

Penalized NSa. A major difference between the crude NSa and its penalized counterpart in¬ 
troduced in [16] relies on the exploitation of the failure of the selected arms. The crude NSa ([T]) 
only uses the sequence of successes to update the probability distribution (X n , 1 — X n ) since the 
value of X n is modified iff A = 1. In contrast, the penalized NSa § also uses the information 
generated by a potential failure of the arm I n +\. More precisely, in the event of success of the 
selected arm I n +i, this penalized NSa mimics the crude NSa, whereas in the case of failure, the 
weight of the selected arm is now multiplied (and thus decreased) by a factor (1 — 7 n +i( 0 n+i) 
(whereas the probability of drawing the other arm is increased by the corresponding quantity). 
For the penalized NSa, the update formula of (X n ) n >i can be written in the following way: 

Xn+l = X n + 7n+i [lj„ +1 =l — Xn\ A^+i 

^n+lPn+1 \X n tI n+1 = l - (1 - X n )t In+1=2 ] (1 - A^+f). ( 8 ) 

Over-penalized NSa. In view of the minimization of the regret, we will show that it may be 
useful to reinforce the penalization. For this purpose, we introduce a slightly “over-penalized” NSa 
where a player is also (slightly) penalized if it wins: 

• If player 1 wins, then with probability 1 — a it is penalized by a factor 7n+ip n +i-^rs- 

• If player 2 wins, then with probability l —a arm 1 is increased by a factor of 7 n +iPn +i(l-X„). 
The over-penalized-NSa can be written as follows 

K+i = XZ + 7n+1 [l 7n+1=1 - X°\ A 1 ^1 

ln+iPn+1 [X-l /n+1= i - (1 - XZ)l In+1=3 ] (1 - A^Bl + r) (9) 

where (B^) n is a sequence of i.i.d. r.v. with a Bernoulli distribution B(<r), meaning that P(£?£ = 
0) = 1 — a. Moreover, these r.v. are independent of {A J n ) n j and in such a way that for all n £ N, 
and I n are also independent. It should be noted that 

1-A^B^ = [l-A 1 -] +A%(1-BZ). 

In fact, this slight over-penalization of the successful arm (with probability a) can be viewed as 
an additional statistical excitation which helps the stochastic algorithm to escape from local traps. 
The case a = 1 corresponds to the penalized NSa ([ 8 ]), whereas when cr = 0, the arm is always 
penalized when it plays. In particular, this modification implies that the increment of X° is slightly 
weaker than in the previous case when the selected arm wins. 


Asymptotic convergence of the penalized NSa. Before stating the main results, we need 
to understand which regret R n could be reached by penalized and over-penalized NSa. We recall 
(in a slightly less general form) the convergence results of Proposition 3, Theorems 3 and 4 of [16] . 

Theorem 2.1 (Lamberton & Pages, [16]). Let 0 < p 2 < Pi < 1 and q n = 7 in _a and p n = p\n~ l3 
with (a, (3) £ (0, +00) and ( 71 , pi) £ (0,1) 2 . Let (X n ) n be the algorithm given by (|8]). 

i) If 0 < /3 < a and a + /3 < 1, the penalized two-armed bandit is infallible . 

ii) Furthermore, if 0 < j3 < a and a + ft < 1, then -— —» - a.s. 

Pn Pi ~ P 2 
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in) If a = p < 1/2 and g = 71 / p k : - — -^-+ p, where -H-+ stands for the convergence in 

Pn 

distribution and /i zs the stationary distribution of the PDMP whose generator C acts on 
C c 1 (K+) as 

V/ G C^(M + ) Cf(y) = P 2 V ^ V + 9 ' > — tM. + (1 - Pl - p iy )f(y). 


In view of Theorem 2.1 we can use formula ([6]) to obtain 


Rr, 


= (Pi-P2)^2pkK 


k —1 


We then obtain the key observation 


supE 


1-X n 


<C< +00 


R, 


i-x k 

Pk 


< C(pi -p 2 ) E Pk , 

k =1 


( 10 ) 


( 11 ) 


where C is a constant that may depend on pi and p 2 - According to Theorem 1 2. 1[ it seems that the 
potential optimal choice corresponds to the one of {in). Indeed, the infallibility occurs only when 


a > p and a + P < 1 and Equation (10) suggests that P should be chosen as large as possible to 


minimize the r.h.s. of (11), leading to a = P = 1/2. This is why in the following, we will focus on 
the case: 

In = - 7 = and p n = E- ( 12 ) 


/n 


2.3 Over-penalized multi-armed NSa 

We generalize the definition of the penalized and over-penalized NSa to the d-armed case, with 
d > 2. Let p = (pi,... ,pd) € (0, l) d and assume that A J n ~ B{pj) {pi the probability of success 
of arm i). The over-penalized NSa recursively defines a sequence of probability measures on 
{l,...,d} denoted by (II„) n >i where n ra = (X^,...,X d ). At step n, the arm I n +\ is sampled 
according to the discrete distribution X n and tcrthen tested through the computation of 
Setting j G {1,..., d}, the multi-armed NSa is defined by: 


X J n+ i = x n + 7n+l [1/^ - Xi] A; 


n+1 


In+lPn+l 


XL 


l (l- A; 


+ 1 R CT '1 
n +1 - D n+1 ) 


1 - 1 


‘-In+l—j 


In+l=j 


d- 1 


(13) 


In contrast with the two-armed case, we have to choose how to distribute the penalty to the other 


arms when d > 2. The (natural) choice in (13) is to divide it fairly, i.e., to spread it uniformly 


over the other arms. Note that alternative algorithms (not studied here) could be considered. 


3 Main Results 


3.1 Regret of the over-penalized two-armed bandit 

First, we provide some uniform upper-bounds for the two-armed cr-over-penalized NSa . Our main 


result is Theorem 3.2 Before stating it, we choose to state a new result when <7 = 1, i.e. for the 
“original” penalized NSa introduced in [El- 

Theorem 3.1. Let {X n ) n >0 be the two-armed penalized NSa defined by (| 8 j) with ( 7 „, p n ) n > 1 defined 
by (12) with ( 71 , pi) G (0,1). Then, for every 5 G (0,1), a positive C 5 exists such that: 


Vn G N, sup R n < CsVn- 

(pi,P2)e[0,l],P2<PlA(l-4) 
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Remark 3.1. The upper bound of the original penalized-NS algorithm is not completely uniform. 
From a theoretical point of view, there is not enough penalty when P 2 is too large, which in turn 
generates a deficiency of the mean-reverting effect for the sequence ((1 — X n ) / p n ) n >i when X n 
is close to 0. In other words, the trap of the stochastic algorithm near 0 is not enough repulsive 
and Figure [7] below shows that this problem also appears numerically and suggests a logarithmic 
explosion of sup R n /ifn. 

Pl<P2 


This explains the interest of the over-penalization, illustrated by the next result, which is the main 
theorem of the paper. 


Theorem 3.2. Let ( X n ) n >o be the two-armed cr -over-penalized NSa defined by ([9]) with cr £ [0,1) 
and ('Yn, p n )n>i defined by (12) with ('Ji,pi) £ (0, l) 2 . Then, 


(a) A C a ("fi,pi) exists such that: 

Vn£ N, 


sup R n < C a (71, pi ) yfn. 

(piiP2)e[o,i],P2<j>i 


(b) Furthermore, the choice a = 0, 7 n = 2.63p ra = 0.89/y / n yields 

Vn £ N, sup R n < 31.1v / 2 n. 

(pi,P2)e[o,i],p 2 <pi 


(14) 


Remark 3.2. At the price of technicalities, C a could be made explicit in terms of 71 and p\ for 
every 0 > 0. The second bound is obtained by an optimization of Co (71, pi) (see (38) and below). 




Figure 1: Evolution of n 1—>■ sup( pi J , 2 ) e [ 0 i] p2 < pi for the over-penalized algorithm (with a = 0) 

and comparison with EXP3 and KL-UCB. 


Figure 1 presents on the left side a numerical approximation of n 1 —> sup R n /\/n for the penalized 


u 


P2<Pl 


and over-penalized algorithms. The continuous curves indicate that the upper bound 31.1\/2 
in Theorem |3.2| is not sharp since the over-penalized NSa satisfies a uniform upper-bound on 
the order of 0.9i Jn. This bound is obtained with a small <7 (as pointed in Theorem 3.2), and 
4p„ (red line in Figure jlj (left)), suggesting that the rewards should always be 


Tn 


/4+n 


over-penalized with p n = 

The right-hand side of Figure [l] focuses on the behavior of the regret with a. The map (n, a) 1—>■ 
sup R n /\/n confirms the influence of the over-penalization and indicates that to obtain optimal 

Pl<P2 


performances for the cumulative regret, we should use a low value of a between 0 and 3/5. The 
importance of this choice of cr seems relative since the behaviour of the over-penalized bandit is 
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Figure 2: Evolution ofni-> su P(pi,p 2 )e[o,i],p 2 <pi fjk f° r the over-penalized algorithm (with a = -) 
and comparison with EXP3 and KL-UCB. 


stable on this interval. The best numerical choice is attained for a = 1/4 and p n = \^ n and 
permits to achieve a long-time behavior of R n /y/n of the order 3/4 (see Figure [ 2 J red line). 
Finally, the statistical performances of the over-penalized NSa are compared with some classical 
bandit algorithms: KL-UCB algorithm (see e.g. [9] and the references therein) and EXP3 (see 
0 )- These two algorithms are anytime policies that are known to be minimax optimal with a 
cumulative minimax regret of the order yfn. Figure [ 2 ] shows that the performances of the over¬ 
penalized NSa are located between the one of the KL-UCB algorithm and of the EXP3 algorithm 
(our simulations suggest that the uniform bounds of KL-UCB and EXP3 are respectively 1/2 and 
3/2). Also, it is worth noting that the simulation cost of the over-penalized NSa is strongly weaker 
than the initial UCB algorithm (the phenomenon is increased when compared to KL-UCB, which 
requires an additional difficulty for the computation of the upper confidence bound at each step): 
the same amount of Monte-Carlo simulations for the over-penalized NSa is almost hundred times 
faster than the KL-UCB runs in equivalent numerical conditions. 


3.2 Convergence of the multi-armed over-penalized bandit 

We first extend Theorem |2.1| of [ 151 to the over-penalized NSa in the multi-arnred situation. The 
result describes the pointwise convergence. 

Proposition 3.1 (Convergence of the multi-armed over-penalized bandit). Consider pd < ... < 
P 2 < Pi and 7 n = 71 n~ a ,p n = p\n~& with ( a, f3) £ (0,+ 00 ) and ( 71 , pi) £ (0, l) 2 . Algorithm Q 
with a £ ( 0 , 1 ] satisfies 

i) If 0 < (3 < a and a + fi < 1, then lim „_ ! . +00 II„ = (1, 0,..., 0) a.s. 
ii) Furthermore, if 0 < fi < a and a + ft < 1, then: 


V* £ {2,..., d}, 


Xk _ 1-gpi 

Pn (d-l)(pi~Pi) 


a.s. 


Proposition 3.2 provides a description of the behavior of the normalized NSa while considering 
Y n j = -yW It states that (F) lj .) n >o converges to the dynamics of a Piecewise Deterministic 
Markov Process (referred to as PDMP below). 











Proposition 3.2 (Weak convergence of the over-penalized NSa). 
sition 3.1 if a = (3 <1/2 and g = 'fi/pi, then: 


Under the assumptions of Propo- 


(A7 2i ■ * ■ 5 X n df) ^ Pdi 

Pn 


where pd is the (unique) stationary distribution of the Markov process whose generator Cd acts on 
compactly supported functions f of C 1 ((M. + ) d ~ 1 ) as follows: 


^df{y 2 ,-,Vd) = ~ (f{y 2 ,-,yi +9,-,yd) - f{y 2 ,-,Vi,-yd)) 

i=2,...,d * 

+ ( 1 d ° P \ -P^yi) d if{y2,-,yd)- (15) 

i=2,...,d 

3.3 Ergodicity of the limiting process 

In this section, we focus on the long time behavior of the limiting Markov process that appears 
(after normalization) in Proposition [3T2J As mentioned before, this process is a PDMP and its long 
time behavior can be carefully studied with some arguments in the spirit of [6|. We also learned 
about the existence of a close study in the PhD thesis of Florian Bouguet (some details may be 
found in 0 ). Such properties are stated for both the one-dimensional and the multidimensional 
cases. 


3.3.1 One-dimensional case 


Setting 


1 , 7i P 2 

a = 1 — Pi, o = Pi, 3 = — .c = — 


Pi 9 

the generator C given by Proposition |3.2| may be written as: 

V/eC 1 (R+,R) Cf(x) = (a-bx)f'(x) + cx {f(x + g)-f(x)). 

jump rate 


(16) 


deterministic part 


jump size 


In what follows, we will assume that a, b, c and g are positive numbers. We can see in C two parts. 
On the one hand, the deterministic flow that guides the PDMP between the jumps is given by: 

( d t f){x,t) = (a — bx)d x <j>{x,t) 

( (/)(: r, 0) = x £ R!j_ 

so that 

(f{x,t) = ^ + (x - e~ bt . 

Hence, if x > | (resp. x < |), 11-4 (f(x,t) decreases (resp. increases) and converges exponentially 
fast to |. 

On the other hand, the PDMP possesses some positive jumps that occur with a Poisson intensity 
“c.a:”, whose size is deterministic and equals to g. 

From the finiteness and positivity of 3, it is easy to show that for every positive starting point, the 
process is a.s. well-defined on R + , positive and does not explode in finite time. The fact that the 
size of the jumps is deterministic is less important and what follows could easily be generalized to 
a random size 3 (under adapted integrability assumptions). In Figure [3] below, some paths of the 
process are represented with different values of the parameters. 


3.3.2 Convergence results 

As pointed out in Figure [3j the long-time behavior of the process certainly depends on the rela¬ 
tionship between the mean-reverting effect generated by and the frequency and size of the 

jumps. 
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Figure 3: Exact simulation of trajectories of a process driven by (16) when g = 0.1, a = 0.2, & = 
0.8, c = 0.2 (top left) g = 2, a = 0.2,6 = 0.8, c = 0.1 (top right), g = 2, a = 0.9,6 = 0.9, c = 0.15 
(bottom left) and g = 2, a = 0.8,6 = 0.2, c = 0.05 (bottom right). 


Invariant measure The process (16) possesses a unique invariant distribution if 6 — eg > 0. 


Actually, the existence is ensured by the fact that V ( x ) = x is a Lyapunov function for the process 


since 


\/x £ 


‘■+1 


CV ( x) = a — (6 — cg)x = a — (6 — cg)V(x) 


Among other arguments, the uniqueness is ensured by Theorem 3.3 (the convergence in Wasserstein 


distance of the process toward the invariant distribution implies in particular its uniqueness). We 
denote it by goo below. It could also be shown that Supp^oo) = (a/6, +oo), that the process is 
strongly ergodic on (a/6, +oo) (see [T2] for some background) and that if 6 — eg > 0, the process 
explodes when t —> +oo (this case corresponds to the bottom left-hand side of Figure [3]). Finally, 
it should be noted that for the limiting PDMP of the bandit algorithm, 


6 - eg = pi - p 2 = 7T 


and thus, the ergodicity condition coincides with the positivity of n. 


Wasserstein results We aim to derive rates of convergence for the PDMP toward goo for two 
distances, namely the Wasserstein distance and the total variation distance. Rather different ways 
to obtain such results exist using coupling arguments or PDEs. We use coupling techniques here 
that are consistent with the work of [7! and [10] ■ Before stating our results, let us recall that the 
p -Wasserstein distance is defined for any probability measures g and v on by: 

W p {g, v) = inf {E (|A - Y\ p ))v \ C(X) = g, C(Y) = v } . 


Designating g 0 as the initial distribution of the PDMP and g t as its law at time t, we now state 


the main result on the PDMP in dimension one driven by (16). 


Theorem 3.3 (One dimensional PDMP). Let p > 1 and denote for every t > 0 g t := C(Xjf°) 
where (Xf°) is a Markov process driven by (16) with initial distribution go (with support included 
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in M+j. If p = 1, we have 


z(Mo - Moo )(dx) 


e wt < Wi(nt, Hoo) < Wi(Mo,Moo)e 


and if p > 1 , a constant exists such that 

Wp(Mt,Moo) < 7 P e~W 
where ( 7 P ) P >i satisfies the recursion 7 ^ = "i p Z\[pa + (1 + g) p ]. 

Remark 3.3. If p = 1, the lower and upper bounds imply the optimality of the rate obtained in 
the exponential. For p > 1. the optimality of the exponent e -7rt / p is still an open question. 

We now give a corollary for the limiting process that appears in Proposition |3.2| 


Corollary 3.1 (Multi-dimensional PDMP). Let [Yt)t >0 be the PDMP driven by (15) with initial 
distribution Mo £ 


\d 


) . Then, the conclusions of Theorem 3.3 hold with n = pi — p 2 - 


The proof is almost obvious due to the “tensorized” form of the generator Cd- Actually, for 
every starting point y = (j/ 2 , • ■ ■, j/d), a ll the coordinates (Yf ) t >0 are independent one-dimensional 
PDMPs with generator C defined by (|T6|) with 


1 - crpi 

d- 1 : 


bi = pi and c* = pi/g. 


(17) 


The result then easily follows from Theorem 3.3 with a global rate given by min{ 6 j — Cig,i = 
2,..., d} = pi — p 2 - The details are left to the reader. 


3.4 Total variation results 


When some bounds are available for the Wasserstein distance, a classical way to deduce an upper 
bound of the total variation is to build a two-step coupling. In the first step, a Wasserstein coupling 
is used to bring the paths sufficiently close (with a probability controlled by the Wasserstein bound). 
In a second step, we use a total variation coupling to try to stick the paths with a high probability. 
In our case, the jump size is deterministic and sticking the paths implies a non trivial coupling of 
the jump times. Some of the ideas to obtain the results below are in the spirit of j7], who follows 
this strategy for the TCP process. 

Theorem 3.4. Let Mo be a starting distribution with moments of any order. Then, for every e > 0, 
a C e > 0 exists such that: 

11 Mo Pt - Moo Pt 11 tv < C e e~ ( ' a ' K ~ E)t with a = 1 ^ ■ 

ac 


Once again, this result can be extended to the multi-armed case. 


Corollary 3.2. Let (Y t )t>o be the PDMP driven by (15) with initial distribution mo £ (R!j_) d Then, 
the conclusions of Theorem\3.f \ hold with an replaced by: 


E 


1 


7Tj 


where ni = pi — pi and ai, bi and Ci are defined by 

The proof of this result is based on the remark that follows Corollary |3.1[ Owing to the “ten- 
sorization” property, the probability for coupling all the coordinates before time t is essentially 
the product of the probabilities of the coupling of each coordinate. Once again, the details of this 
corollary are left to the reader. 
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4 Proof of the regret bound (Theorems 3.1 and 3.2) 


This section is devoted to the study of the regret of the penalized two-armed bandit procedure 
described in Section [2] We will mainly focus on the proof of the explicit bound given in Theorem 
3.2 b) and we will give the main ideas for the proofs of Theorems 3.1 and 3.1 a). 


4.1 Notations 

In order to lighten the notations, X\ will be summarized by X n , so that X ' 2 = 1 — X n . 

The proofs are then strongly based on a detailed study of the behavior of the (positive) sequence 
(Yn)n>i defined by 

Vn > 1 Y n = -— Vx. (18) 

In 

As we said before, we will consider the following sequences {'j n )n>i and ( p n )n>i below: 

Vn > 1, 7 n = -7= and p n = A= = hln and pi = — , 
yn sjn 71 

where 71 and p\ are constants in (0,1) that will be specified later. In the meantime, we also define: 


7T =Pi ~P 2 £ (0,1). 


With this setting, the pseudo-regret is 

n 

Rn = 7T^7„E[F„]. 

n= 1 

It should be noted here that we have substituted the division by p n in by a normalization 
with 7 n . This will be easier to handle in the sequel. The main issue now is to obtain a convenient 
upper bound for E[F n ]. More precisely, note that: 

Vno £ N Vn < no — 1 , R n < nn < ny/no — 1 y/n, 

and conversely for every n > no, 

Rn 
\fn 


< 

Thus it is enough to derive an upper bound of E[K n ] after an iteration no that can be on the order 
of 1 / 7r 2 . In particular, the “suitable” choice of no will strongly depend on the value of 7r. 

4.2 Evolution of (Y n ) n > 1 

Recursive dynamics of (Y n ) n > 1 . In order to understand the mechanism and difficulties of 
the penalized procedure, let us first roughly describe the behavior of the sequences (X n ) n >i and 
(Y n ) n >i- According to 

E [X n+ i\F n \ = X n + 7 n+ iX n (l — X n ) [pi — P 2 ] 

+ln+lPn+l [(1 - A„) 2 (l - up 2 ) - Xl(\ - <T Pl )\ . 

It can be observed that the drift term may be split into two parts, where the main part is the usual 
drift of NSa described by h defined by: 

Mx £ [ 0 , 1 ], h{x) = [pi - p 2 ]x(l - x). 
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1 n 

7 T\/no -1 + 7 r sup E[F n ]—— V] —R 
n>no Vn „t ^ 0 Vk 

7 T ( Vno - 1 + 271 sup E [Y n \j . 

\ n>n 0 J 


(19) 


( 20 ) 











The second term comes from the penalization procedure and depends on a. We set 


k<t(x) = (1 - ap 2 )( 1 - x ) 2 - (1 - api)x 2 . ( 21 ) 

As a consequence, we can write the evolution of (X„) n >o as follows: 

1 — x n+1 = 1 — X n — 7 n+i [/i(X n ) + / 5 n+ i«: 0 -(A' ri ) + AM n+ i], ( 22 ) 

where AM n+1 is a martingale increment. On the basis of the equation above, we easily derive that 
Vn > 1, Y n+ 1 = Y n (1 + 7 n (e n - 7 rX„)) - p n+1 n(X n ) + AM n+i 

where 


1 


—— = — (Vn+ 1 — Vn) < 


7n+l 7n 71 ' ' 

It follows that the increments of ( Y n ) n >i are given by: 


In 

27 ?' 


(23) 


Aln+i Y n+ 1 — Y„ — 7 n<Pn(Y n ) — AM n+ i 


where the drift function ip n acting on the sequence (i^) n >i is defined as 


<Pn(y) = V x [e n + 7 r( 7 n y - 1)] + ( -^±i/c (7 ( 1 - 7 „y) ) . 

In 






To better understand the underlying effects of the dynamical system, it should be recalled that 
the definition of the sequence (Y n ) n >i implies that Y n £ [0, 7n -1 ] with 7 n -1 ~ On 1 / 2 . Since we 
aim to obtain a uniform bound (over n) of E[Y n ], it is thus important to understand the behavior 
of the drift ip n over [O^tY 1 ]. In particular, it is of primary interest to see where the function p n 
is negative. 

Crude NSa. When dealing with the crude bandit algorithm (ie., when p\ = 0, see 0), the 
drift is reduced to <p] l . One can check that Pn(y) is negative iff 

e n - 7r(l - 7 n y) < 0 y < 7 n ^ 1 - — x > — 

7T7„ 7T 

where x = 1 — 7 n y. This means that when x is close to 0 (in some sense depending on n, 7 r and 71 ), 
becomes positive and Y n has a tendency to increase. In others words, the dynamical system 
(Y n ) n > 1 has no mean-reverting when Y n is far from 0. The fact that the crude bandit algorithm 
does not always converge to the good target can be understood as a consequence of this remark. 

Penalized and Over-Penalized NSa. When the drift ip n contains a non zero penalty, the 
second term —<p 2 n may help the dynamics to not be repulsive when x is close to 0, i.e. when y is 
larger than l/ 7 n- It can be checked that k ct ( 0 ) = 1 — ap 2 and: 

lim p n ( 7 „ _1 ) = i - — (1 - crp 2 )- 

n —>+00 2 "ff pi 

This quantity is negative under the condition: 

1 ~ aP2> 2jf' (24) 

But, in order to obtain a uniform bound on the regret, this constraint must be satisfied indepen¬ 
dently of p 2 . When a = 1, i.e. in the standardly penalized case, one remarks that for any choice 
of pi and 71 , this is only possible if pi/( 2 y?) > 1 — p 2 . At this stage, one can thus understand the 
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over-penalization as a way of controlling uniformly (in P 2 ) the negativity of ip n far from y = 0 (see 
Figure [4]). 

In view of the main results, there are still two problems. The first one is that even in the over¬ 
penalized case, Inequality (|24| implies some constraints on 71 and p\, which do not appear in 


Theorem 3.2 The second one which is more embarassing for the study of (E [Y n ]) n >i is that, near 
y = 0, Pri is positive since ip n (0) = 1 — op\ ((see Figure [4])). This repulsive behavior near y = 0 
can be understood as the counterpart induced by the penalization. In order to bypass the two 
previous problems, the main argument will be the increase of exponent (see next section) where 
we show that we can replace the study of (E[F„]) n >i by the one of a sequence which both has a 
nicer behavior near y = 0 and alleviates the constraint (24). 




Figure 4: Drift decomposition (left) and global (right) when y G [0, ^-] with 7 1 = pi = 1, pi = 0.7 
,P 2 = 0.6, a = 0.5. 


4.3 Increase of exponent 

We introduce the sequence (Zn ^) n >0 defined by: 

Vn > 1 Z< r > = (25) 

7 n 

At this stage, one can first remark that a.s., for every r > 1, Zn ' 1 < Zn^ = Y n . One can thus 
guess that the difficulties tackled at the end of the previous section will be easier to overcome for 
(f&[Zn^]) n >i with r > 1. Of course, this remark has an interest if conversely, one is able to relate 
the control of E[F n ] to those of E [Zn "*], r > 1. 

This is the purpose of Proposition |4.1| where taking advantage of the structure of the algorithm, 
one shows that for every r > 1, E [Zn'*] can be controlled by a function of E [Zn +l ' > \- 
Let us define the bounded function h r on [0,1]: 


V 7 G [0,1] 




Proposition 4.1. Let r G N*, 71 G (0,1) and 0 < e < cq = and set 


n 0 (e,7r,7i) 


1 

4e 2 7^7r 2 


+ 1 . 


(26) 


( 27 ) 


Then, if 2ejl(r — e) < 1, 


sup E Z^ < EZW 


Pi + h r {"1 no ) + 7T SUp E [Z£ +1) ] 
n>no 


n>n 0 " n{r - e) 

In particular, for r = 1,2, the previous inequality holds for every 71 G (0,1) and e G (0,1/3]. 
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Remark 4.1. Note that the above result induces some constraints on e and 7. These constraints, 
which allow us to manage the constants of the inequality, are mainly adapted to the proof of Theorem 


3.2 ( b). In fact, in the proofs of Theorems 3.1 and 3.2 (a), we will need to rewrite the above property 


in a slightly different way (see Section 4-5 for details). 


Proof For any integer r > 0 and n > 0, the binomial formula applied to ( 22) leads to 

(l-A„ +1 ) r = (l-X„-AI n+ i) r 

= (1 - X n y - r( 1 - Af ra ) r_1 AA' ri+ i 

r —2 / \ 

+E L- )^~x n y(-Ax n+1 y-f 

3 =0 ' J ' 

where = 0 and AX n+ i = X n+1 - X n = 7 f n+ i[h(X n ) + p n+ 1 n a (X n ) + AM„ + J. From the 
definition of h given in ( 20 ), we get 


(1 - X) r [h(x) + p n +lK a {x)\ = 7+1 - X) r + p n+ 1 K cr (x)(l - X) 


1 - 1 


If we define now 

/?« =-rp n+1 (l-X n y~ 1 Ka (X n 


Tn+1 


j =0 


^rrui-xj^-AA, 


n+l) 


(28) 


we can then conclude using (251 that 
7 W 

Z/ n +1 

= Z + - ' Yn rwX n zP + + r) - r(l - X n ) r ~ 1 AM, 


7n+l 

= Z^f l + 7„ 


n+l 


- mXr. 


) + /3^-r(l-X n ) r - 1 AM n+1 


_7n+l 7 n 

= ZW (1 + 7n [e n - r-KX n }) + pW - r(l - A„) r ' _1 AAf n+1 
= zW (1 + 7n [en - H) + r7T 7n (l - X n )Z« + /+> - r(l - X+^A M n+1 
= Z+ (1 + 7n [e„ - r7r]) + r7r 7n Z+ K) + + r) - r(l - X n ) r ~ 1 AM n+1 . 


(29) 


The formulation above is important: it exhibits a contraction of (1 + 7n [e„ — +) on that 
can be used jointly with an upper bound of Zn +1 ^ and a simple majorization of /3+. In this view, 
we study (28): |AX n+ i| < 7n +i a.s. and (21) yields \n a (x)\ < (1 — ap 2 ). Now, with h r given in 
(26), we get 

r-2 , \ 

Pi r) <rp l7n +l+Y^ { r )(.Tn+iy~ :, ~ 1 < r(p! + h r ( 7 n+l)) Tn+l- 
i=o 

For any e £ (0,1), we can see in (29) that the contraction coefficient can be useful as soon as n is 
large enough. More precisely, using (23), we see that 


£n < e 


> n 0 (e,7r, 7 i) := 


1 


4e 2 7 + 2 


+ 1 . 


Then, for every n > no(e, 77 , 71 ), 

1 + 7n + — + < 1 — OL r7n with a r = 7 r (r — e). 

In the sequel, we will omit the dependence of no in (e, 7 r, 7 i) and will just use the notation no- 
Also remark that under the condition 2e 7 2 (r — e) < 1, we have at r7 j < 1 for every 7 r £ (0,1) and 
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for every j > no (one can in particular check that 2eyf(i — e) < 1 is true for every e £ (0,1/3) and 
7 i £ (0,1) if r = 1, 2). Thus, by a simple recursion based on (29), one obtains for every n > no + 1, 


n —1 


n —1 


n— 1 


E(Z^) < E(ZW) J] (1 - a r7j ) + £ (r7T7jE(zj r+1) ) + $ r) ) 1^(1- a rll ) 

j—n 0 j=n 0 l=j 

If we call 0 r = r ^ 7 r sup ^E(Zj r+1 ^) + Pi + h r (' s /j )^ 'j , an iteration of the previous inequality yields: 


n—1 n—1 


E(Z^) < E (Z«) + 0,, £ 7-# nd-«,7;)- 


j=n 0 i=j 


We aim to apply Lemma A.l (deferred to the appendix section) to the last term. It is possible as 
soon as 

n 0 > 


1 


(a£7i) 2 ' 

This last condition is fulfilled for any r > 1 when 4 e 2 ^ 2 7r 2 A ^ 1 _ f y 2 n 2 ^ 2 , be. when e < 1/3. 
Then, by Lemma A.l one deduces that Ve < 1/3 and Vn > no : 


sup E< EZW 


n>n o 


-7- i Pi + M7n 0 ) + 7T SUp ^ r+1) 

7T(r 0 L n>no 


□ 

On the basis of the last proposition and a recursive argument, we can now deduce the following 
key observations. 


Corollary 4.1. Assume that e £ (0,1/3), 71 £ (0,1) and that no is defined in (27). Then, 

1 


supE [Y n \ < E[ZW] + E[Z "° )] 


n>no 


1 — e 


r(l-e) 


Pi 


Pi 


1 


1 — e/2 2(1-e/2) 


(3) 


sup n >„ 0 EZZ 
'(l-ej(l-e/2 )■ 


(30) 


Remark 4.2. As in Proposition f.l, this property is mainly written in view of Theorem 3.2 ( b ) 
where we only need to use the increase of exponent for r = 1,2. For Theorems 3.1 and 3.2 (a) with 
er £ ( 0 , 1 ), we will need to use it for large values of r. 


4.4 Bound for (E(z£ 3) ))„>„„ 


/o\ 

As seen in Corollary 4.1 our next task is to bound E(Z„ ) for n > no to obtain a tractable 
application of Equation (30). Such a bound is reached through careful inspection of the increments 

A 7(3)_ 7 ( 3 ) _ 7 ( 3 ) 

^^n + l ■— Z n +1 ■ 

Lemma 4.1 (Decomposition of Zn -*). For every n > 1, 

E[AZ^| F n ] = 7 „+i(l - X n )P n (X n ) + AR n , 
where for every n £ N, P n is a polynomial function defined by 
( 1 -z ) 2 


P n (x) = X —— ( £n _ 37 ]-a;) - 3pi(l - x)n a {x) + 3 (ir(l - x) 2 pi + x 2 (l - x)p 2 ) 

ln+1 

+ 7„+i ( x(l - x) 2 pi + X 3 P 2 ) , 


( 31 ) 


and if 71 and no satisfy the assumptions of Proposition |/. 1 | then 

Vn > n 0 , A R n < (1 - ap 2 ) [3 7 n +ip 2 n+1 + 7n+iPn+i] ■ 
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Remark 4.3. • The keypoint is that 7 *. = 7 \k~ x t 2 and, therefore, the series ff n>J A is 

uniformly bounded, regardless of the value ofn. This will be enough to obtain a competitive 
upper bound of the regret. With the choice of no given in (27), careful inspection of Lemma 

(32) 


4-1 leads to: 
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E AR k < 127^?«r+ 4r7?/5?e 3 7T 3 - 


fc>no 


As in Remark 4-% it should be noted that for Theorems 3.1 and 3.2 (a) with a £ (0,1), we 
will need to use such a development with some larger values of r (see the end of this section 
for details). 


Proof. We again use Equation (29) and deduce that: 

ZfL ~ Zi 3) = (1 - X n ) 3 (e n - 3nX n )) - 3~ Plln+1 (l - X n f^{X n ) 


Tn+l 


E 

j =0 




X n ) 2 n a (X. n ) 

(33) 

X n ) 2 A M n+1 

(34) 


First, note that terms in Equation (33) are associated with the first two terms in the definition of 
P n introduced in (31) up to a multiplication by (1 — X n )j n+ -[. 


Second, we can easily compute the expectations involved in the sum of Equation (34) since the 
events are all disjointed. On the one hand, when j = 1 we have 


1 


7n+l 


E[(-AA n+ 1 ) 2 |J- n ] = ln+ 1 a( Pl X n (l- X n ) 2 +p 2 (l-X n )Xf) 


+ 7n+i(l — a) (jp\X n {l — X n — p n+ \X n ) 2 + P 2 (1 — X n )(X n — p n+ i(l — X n )) 2 ) 
+ Tn+ip 2 n+ i [X 3 {l- Pl ) + {1-X n ) 3 {l -p 2 )] . 

Further computations yield: 

— E[(-AX n+1 ) 2 | X n ] = 7n+iA' n (ff Pl (l - X n ) 2 + ap 2 X 2 ) + A 

PTn+ipl+i [A 3 (l - ap 1 ) + (1 - X n ) 3 (l - ap 2 )\ 


7n+l 


:=A R. 


(i) 


with A An' 1 = —2p n+ i7 n+ iA' n (l — X n )(\ — a)(X n pi + (1 — X n )p 2 ). On the other hand, we can 
also compute the term when j = 0 : 

1 E[(-AA n+1 ) 3 |J- n ] = 7 2 +1 AT n (l — X n ) (p 2 X 2 — pi(l — X n ) 2 ) 


7n+l 


+ A A^ + 7n+iPn+i [**(1 - crpi) - (1 - X n ) 4 (l - ap 2 )\ 


:=A Rl 


( 2 ) 


with A A {2) < 37 2 + 1 p n+ i(l-(r)A„(l-X n ) 2 (7 xX n + p„+i(l - X n )p 2 ). Set A R^ = (l-X n )AA^+ 
A An^ and A R n := 3(1 — X n )A Rn ' 1 + A Rif'. Plugging the previous controls into (34) yields 


EtAZ^il Xn] < 7n+i(l - X n )P n (X n ) + AR n . 


(35) 


Note that ARf 1 can be upper bounded as follows: 

3(1 - X n )AR^ < 37 n+ ip 2 +1 (l - ap 2 ) max 


0<t<l 


1 - <rpi 


1 - o-p 2 


t 3 (l-t) + (l-t ) 4 


Since 1 — ap\ < 1 — ap 2 , a study of the function shows that at 3 { 1 — t) + (1 — f ) 4 when a £ (0,1) 
reaches its maximal value for t = 0. This leads to: 

3(1 - X n )AR^ < 3 ln+ 1 p 2 n+1 (l - a P2 ). 
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For A R^, we have Ai?i 2) < 7 2 + iP 3 +i(1 ~ 0 P 2 ) max 0 < t <i 
increasing function of t. Thus, we have 


1— crpi jA 
l-ap 2 u 


(1-t) 4 


which involves an 


A R { n ] < Tn+lPn+lC 1 ~ °Pl) < 7n+lPn+l( 1 ~ VP*)- 


Finally, if 71 and no satisfy the assumptions of Proposition 4.1 then for every n > no, < 2/3 

( 3 ) - 

and it follows that Ai?„ < 0. The result follows according to Equation (35). □ 

In order to bound sup n>no E(Z„ '), we now have to precisely study the polynomial function P n 
and exhibit a mean reverting effect on its dynamics. 


Proposition 4.2. Let e £ (0, 4), p\ < ||| and 


3\/2(l ~<r)pi 


<7? < 


2(1+Pl) 


Then 


i ) 77ie polynomial P n given by (31) is negative on [0,1 — 


2 ( 1 +Pl) 


Tn+l\ 


ii) Zn ' satisfies 


sup EZ^ 3) < E Zff + U'yfpjeTT + y 7 iP?e 3 Tr 


n>no 


+ 


87 4 e(l + pi) [l + (1 + Pi)[2 + 6pi + 127 jf]] 


Remark 4.4. The above result is given under some technical conditions that will lead to a sharp 
explicit bound. Nevertheless, the reader has to keep in mind that in view of the condition on a, 
the “universal” bound on (E (Zn ))n>n 0 on/?/ accessible when a < 1, i.e. in the over-penalized 

case. When a = 1, some bounds will be attainable only if P 2 is not too large (see (24) for a similar 
statement when r = 1), and in order to alleviate the constraint on P 2 , it will ne necessary to take 
a larger exponent than r = 3 (see Subsection 4.5 for details). 


Proof. We first provide the proof of *). The function P n introduced in (31) is a third degree 
polynomial and for n > n$: 


Pn(0) = 

< 

< 


7n+l 

7n 


- 3piK CT (0) 


27?7n+l 

\/l + n 0 1 


- 3pi(l - ap 2 ) 


27 ? 


- 3/0i(l - ap 2 ) 


Since p 2 < 1, this last quantity is negative if: 

Pill > 


3\/2(l - cr)' 


(36) 


In a same way, we can check that P ra (l) = 7 n+iP 2 > 0 and, therefore, P n has one root in the 
interval (0,1). Careful inspection of the leading coefficient (designated a n x 3 ) of P n in (31) shows 
that: 

3 

a n = 3(1 + upi) --- 7n+l 7T- 

7n+l 


The leading coefficient a n is negative as soon as 3(l + crpi) < 77 -. Again, the choice of n 0 in (27) 
shows that this last condition is fulfilled as soon as 


j > 271^(71 + api ). 


(37) 
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It should however be noted that we have assumed e € (0,1/3] so that \ > 3. As a consequence, 
(36) and (37) are satisfied as soon as ( 71 , pi) satisfies 


1 2 3 

3\/2(l - a)pi ~ 7l - 2(1 + pi) 


Hence, if (36) and (37) hold, P„ possesses one root in (— 00 ,0) and another one in (l,+oo). 
Consequently, P n has a unique root in (0,1). We now consider: 

c _ 2(1 + Pi) ^ _ 

S n 1n +1 • sTn+1* 


We compute that: 

Pn(l-W 

Sn 

7n+l 


(e„ - 3 tt( 1 - £„)) - 3pi£„ [(1 - crp 2 )£ 2 - (1 - o-pi)(l - £„) 2 ] 


+3 [(1 - Zn)£pi + (1 - U 2 tnP 2 ] + ln+1 [(1 - t,n?P2 ~ £ 2 (1 - £„)] • 

Hence, replacing by ^n+i and simplifying by 7 n +i, we see that P n (l — £ n ) is negative when 

~A„( £) 


ee n 
(1 - U 


+ 3pi(l - (7p \) (1 - £ n )£ + 3pi7„+l£ 2 + 3p 2 (l - £n)£ +P2(1 - £n) 2 

/o_^2 , 3pi7 2 + l? 3 (l-CTP2) , 2 ^2 

< 3^ H- -r—j. -b 7n+if ■ 

s n 




From (23), we know that e n < and 1 — £ n < 1 thus 

- ^7i 


An(£) £ £ 7n+l 


27l 2 (l-?n) 


+ 3pi I + 3£ (pi + 1) + 1 


In the meantime, we will use the simple lower bound B n (£) > 37 t£ 2 . We can check that 1 — = 


1 - 


2(l+pi)7„+i 


> 1 — 4e(l + pi) 7 2 since 7„ 0 < 2 e 7 2 7 r. Thus 


A r 


2(1 + pi) 


/ 4(1 + Pi) 2 

< --o-7n+l 


3pi 


< 


(1 + Pi) 2 


24e7 2 pi 


2 7 l 2 [l-4e(l + Pi)7i 2 ]J 
4e 

+ 7 


1 -4e(l + p!)7 2 


6(1 + pi) 2 


+ 1 


and 


B, , 


2(1 > 12(1 + pi ) 2 


As a consequence, P„( 1 — £ n ) is negative if we have 


5 > 24e 7 2 + 


4e 

1 -4e(l + pi)7^ 


From the constraint on 71 , another computation shows that the above condition is fulfilled when 
e 2 i28(Ufpi) _ e jg4 _|_ 40^ _(_ p 1 ) ] + 45 > 0 . We then observe that all values of e in ( 0 , |] can be 
conveniently used when pi < |||, 
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To obtain ii), the main idea is to use the sharp estimation of the sign of P n on [0,1] and to obtain 
( 3 ) 

an upper bound of E Z\ . Note that: 

sup 7n+i(l — t)P n (t) 

0<i<l 

= 7 n +l Slip (1 - t)P n (t) 

l-i„<t<l 

= 7n+i sup {(1 - t) 3 [e n - 3nt] - 3pi(l - t) 2 K a {t) 

+3 [t(l - t) 3 p! + t 2 ( 1 - tfp 2 ] + 7n+l [~t{l - t) 3 p 1 + t 3 (l - t)p 2 \ } 

We have seen in the proof of i) that t £ [1 — £„, 1] =3* e„ < 37r t. Hence, using K a (t) > — (1 — <rpi)t 2 , 

we have 


sup 7 „+i(l -t)P n (t) 

0 <t<l 

< lu+l [3/5i(l - crpi)^ + 3pi £ 3 +P 2 C +7n+lCn] 

C 1 (p 1 ,p 1 ,p 2 ,a) 3 C 2 (pi,pi) 4 

< - ~ 2 -7n+l + - ^3 -7n+l 

with Ci(pi ,p\,p 2 ,a) = (1+pi) (12/5i(l + pi)(l - 077 ) + Ap 2 (l + pi) + 27r) and C 2 (p\,p\) = 24pi(l+ 
pi ) 3 shortenned in C\ and C 2 below. We apply Lemma 4.1 to upper bound sup n>no E Z {3) : 

sup E 

n>no 

n 

< E Zi 3 J + sup E Y AZ< 3) 


n>nn , 

— u k=n 0 


n+1 


< E Z (3) + sup E 


n>no 


Y 7fc+i(l - X k )P k {X k ) + A R k 


,k=no 


s-y OO OO OO 

< EZ S> + § £ ^ s +i + § £ + £ EABi 

k—riQ k—riQ k—riQ 


Using a simple comparison argument with the integrals t a dt, we obtain: 

OO OO 

Y Tfc +1 < 2 7?«0 1/2 < 47 fen and Y 7fc+i < 7i?V 1 < ^fp 2 '* 2 ■ 


k—nQ 

We then deduce that: 


k=n 0 


sup E< E Z(f + + g ARk . 

7T 7 r 1 ' 


n>no 


k=n 0 


The result now follows using (321. 


□ 


Explicit bound. We can now conclude the proof of Theorem 3.2 


Proof of Theorem 3.2 ( b ). We consider the extreme over-penalized case obtained with a = 0. and 
use a power increment until r = 3. Recall that no := no(e, tt,7i) is defined by (27). In particular, 
y/no — 1 < (2e7i7r) _1 and for i = 1,2,3, 7rE[Z^] < (2e7^) _1 + (yi) -1 . Taking the results of 
Proposition |4.2 |m) and Corollary 4.1 and plugging them into ([T9|, a series of computations yields: 


su Ppi>p 2 R '> 


< c(7i,pi,e) := Ti(7i,pi,e) + 


2yi 


(l-e)(l-e/2) 


72 ( 71 , Pi, e), 


(38) 
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where 


and 


Ti(7i,pi,e) 


1 

2e7i 




+ 2pi 


1 


1 -e 


1 

(I^Kl 



(l-e)(l-e/2)7 

+_H_ 

+ (l-e)(l-e/2)’ 


12 ( 71 , Pi, e) 


= 7i 


16 


8e(l + pi) (l + (1 + pi)(2 + 6pi + 127 ^)) + 12p 2 e + —■7iPi£ 3 


Theorem |3.2[ fr) follows by minimizing ( 71 , pi, e) '—> 0 ( 71 , pi,e) under the the constraints: 

e£1A s 

The “best” upper bound was obtained by setting 71 = 0.89, p\ = 0.38, e = 1/9, leading to the 
regret upper bound 

R n < 44 y/n. 


□ 


4.5 Proof of Theorems 3.1 and 3.2 (a) 


We prove these results together. We thus consider 71 S (0,1), pi £ (0,1) and a € [0,1]. A variant 
of Proposition |4.1| concerning the increase of exponen t is s till valid. First, it can be observed that if 


we set e r = r — 1/2 (so that a r = 7 r/ 2 ), then Lemma A.l can be applied with n > (f 71 ) 2 . Thus, 


we set no (A) := + 1 with A > 1 . After a simple adaptation of the proof of Proposition 

|4.1[ it can be deduced that for every r > 1, 


27- 


sup EZ^<EZ^ + - 


n>no(A) 


Pl + h r {ln 0 (X)) + TT Slip Z^ +1) 
n>no(A) 


By an iteration, it follows by using the fact that 7 rE [Z^,^] < 7r7 no ( A ) <77 X (A + 1) that for every 
r > 1, some constants C/(A) and C 2 ( A) exist (depending only on er, 71 and pi) such that, 

sup 7 riE[y„] <C r 1 (A) + C' 2 (A) 7 r sup E z£ +1 \ (39) 

n>no(A) n>no(A) 


(r) 

It remains to upper bound sup ra>no ( A ) E Z\ y for r large enough. Once again, a simple adaptation 
of the proof of Lemma |4~T] for r > 3 yields: 

EIA^IJ-J = 7n +1 (i - X n y- x P£\X n ) + A R<y\ 

with 


P ( n\x) 

{l-x) 2 ( r \ 

=- —{e n - t-kx) - rpi(l - x)n a (x) + (x(l - x) 2 pi + x 2 (l - x)p 2 ) 

7n+l \ r ~ 2 / 

+ 7n+i ^' r _ (-*(1 - x) 2 Pi + x 3 p 2 ) (40) 


and ARn' 1 < C r 7^ +1 (where C r does not depend on 7 r). We want to prove that Pn 1 is negative 
on [0,1 — £„] with = ^ 7 n +i £ (0,1) where £ is a constant to be calibrated. We follow the lines 
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of the proof of Proposition |4.2[ but we can use some rougher arguments since we are not looking 
for explicit constants. First, Pn\ 0) = — rp\K a (<S), so that: 


P«(0)<0 


7iPi > 


1 

rV 2(1 - ap 2 ) 


On the one hand, for every cr < 1, it is possible to find an r sufficiently large for which this condition 
holds. On the other hand, when a = 1 (case of Theorem 3.11, we then need to assume that a S > 0 
exists such that p 2 < 1 — S (in this case, the condition is satisfied if r > (yipiv/W) -1 ). For such 
an r, it can be observed that the leading coefficient an ' 1 (related to x 3 ) is: 


a 


M = 


It can therefore be deduced that 


(r) 

a„ ; is negative for every n > n° where: 
< : = 71 +cr/51 ) ' 


Assume that A > in order to obtain no(A) > nf. Since Pn\ 1) = 7n+i ( r L 3 )p 2 > 0 and 

deg(Pi r ^) = 3, it follows that P/' 1 has exactly one root in (0,1) for every n > n 0 and that Pn^ is 
negative on [0,1 — £„] as soon as Pn\ 1 — £, n ) < 0. Let n be such that ^7 n +i < 1/2. Then, some 
rough estimations yield that Pn\ 1 — £ n ) is negative if 

1 > 0 , 

where c r is a constant that does not depend on w. We then check that another constant i] r exists 
such that the previous property is fulfilled if £ > ijr/n. Then, Pn\ 1 — ^ 7 „ + i) < 0 is negative as 
soon as ^ 7 „+i <1/2. This is true for every n > n 0 (A) as soon as A > 2 7 ip r . We can conclude 
from what preceeds that an r > 3 and A > 0 exist such that for every n > no (A), for every 
(pi, p 2 ) £ [ 0 , l] 2 , such that pi > p 2 (resp. p± > p 2 and p 2 < 1 — 5) if a < 1 (resp. if a = 1 ) 

E[A< 7 n+i sup (1 - f)P( r >(f) + C rl 3 n+X . 

te[l-^ 7 n+l,l] 


Using 7 n +i < 7r/A if n > no (A), a constant C\ exists such that on 

V£ € [1 — — 7 n+l, 1] Pn r \t) < C , A 7 n+l/ 71 -- 

7r 

Under the previous conditions, we deduce 


sup ( 7r sup E) < sup ( 7r Cy^ +1 (it 2 + n x ) 1 < +oo. 
w \ n>n 0 (X) ) * y „>„ 0 J 


The result follows by plugging this inequality into (391. 


5 Almost sure and weak limit of the over-penalized bandit 

We provide here the proofs of Propositions |3.1| and |3.2| For the sake of simplicity, we restrict our 
study to a = 1 (always over-penalization of the bandit), and the argument can be adapted for any 
values of cr £ (0,1]. 
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5.1 A.s. convergence of the multi-armed bandit (Proposition [371] ) 

Recall first that X n = {X n i, ...,X nd ), the multi-armed penalized bandit © makes it possible to 
define for i G {2, d}, 


X n +1 ,i — X n ^i + '7'n-t-l^'i (-Vi) T ^n+lPn+l^i^Xn) -f- r y n +\X]\I n J r \ d , 

where the main part of the drift hi is defined as 


and the penalty drift is 


hi(x l, ...,£<*) = (1 - Xi)XiPi ~ Xj^XjPj, 


i(x 1 ,... ,x d ) = -z-(l - ft) + “ A')- 




Hence, the martingale increment is simply obtained as 


XM — ((1 Xn t i)^-V n ^.i i i,A n ^-i i i X n j ^ ^ 1 




Pn+1 (^n,ilv„ + i -i ,A= +1 ; ,^—1 ^ 1 ^w,jly n +i.j,A; + i +Kj(A n )) 




Proof of Proposition^/ 7} We start by (i) and identify the stationary points of the ODE method. 
The ODE i; = /i(a;) possesses a finite number of equilibria that can be easily identified. We begin 
by solving the equation h\{x ) = 0. Since 


d 

hi(x) = xi^Xjtpx - pj)> 0, 

i =2 


we either have x\ = 1 and £2 = ... = Xd = 0 or aq = 0. 

Then, the equation /i 2 (#) = 0 with X\ = 0 may be reduced to 


d 

xzJ^Xjipz-pj) > 0. 
i—3 

The same argument leads to X 2 = 1 or X 2 = 0 and a straightforward recursion shows that the 
equilibria of the ODE are (<5*)i<i<d, with ( S z )i<i<d defined as 

5\ = 1 and £*• =0 Vj 7 ^ *. 

Let us emphasize that to discriminate among these equilibria, it is not possible to use the second 

derivative criterion that relies on ( ——- ) to establish their stability. Instead, it is possible to 

\ 9x iJij 

check that c ) 1 fulfills the Lyapunov certificate with the function V(x) = (x 2 + ... + x 2 d ). If we 
denote h = (hi ,..., hd), we then have: 


d 

(VV(x),h(x)) = ^2x 2 ^2x k {pj -p k ). 

j='2 k^j 

Considering a; in a closed neighborhood of 5 1 defined as Xj < e/d, \/j > 2 (implying that x\ > 1 — e), 
we see that: 

d d 

(XV(x),h(x)) = x 1 'Y x ‘j(Pj-Pi) + Y x k x o(Pk-Pj ) 

j —2 k =2 jjtk,j>l 

d d 

< -(1 -e)(p 1 -p 2 )Y x2 j+ € J2 x P 

3 =2 3 =2 
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and the term above is negative as soon as e is chosen such that: 

1 

e < - 

Pi - Vi + 1 

In contrast, the other equilibria (<P),j ^ 1 are unstable: this can be easily deduced from the 
unstability of the two-armed bandit by testing the first arm vs. the arm j. 

Since the martingale increment AM n+ \ t i is uniformly bounded, we can apply the Kushner-Clark 
theorem (see HU) and can conclude that (X. n ^) n >o either converges to 1 or 0 a.s. As a consequence, 
it is also true that ( X n ) n >o converges a.s. We now make this limit explicit and show that (X n ) n >o 
converges toward (1,..., 0) a.s. We start by noticing that hi(x) = ii -> 2 x j(P ~ 1 ~ Pj) — 0, 
which implies that: 


n —1 n—1 

X n ,i > *0,1 + y^7jPj w j-i,iP0-i) + T^AMj. (41) 

3 =1 3 =1 

The martingale increment A Mj is bounded and a large enough C exists such that A Mj < yfC. 
This implies that: 


n—1 

E A; AM ? 

3 =1 


L 2 


n—1 / \ n—1 

<C^ 7 |<Csup(i)^ Wi . 


3 =1 


3=1 


Since ^ pj jj = +oo, we can deduce that 


E 


lim 

n—»-+oo 


3 =1 


n—1 

E ijPj 

3=1 


= 0 so that lim sup 


n—1 

E TjAMj 

_ 

n—1 

E Tj'Pj 

3=1 


> 0. 


We now consider an event w £ {-Xoo,i = 0}. We have: 


lim (X„(cj)) = -—- V(1 -pfc)-Aoo,fe(w) 2 , 
—>+oo a — 1 


fc >2 


and according to the Toeplitz Lemma we deduce that 


lim 

n—>oo 


n—1 


J=1 


n—1 

E 7jPj 

3=1 


■j—[ E(! ~ Pk)X 00 ,k{^f‘ > 0. 

fc> 2 


Putting together this last remark with Equation (41) leads to the conclusion 


lim sup > 0. 

n—t-oc 

E HPi 

3=1 

We obtain a contradiction with the boundedness of (. X n ) n >\ and conclude that = 0) = 0. 

For (ii), we refer to m since the arguments here are similar. □ 
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5.2 Weak convergence of the normalized bandit (Proposition |3.2[ ) 

The proof of the weak convergence follows the lines of |16| . The idea is to prove the tightness 
of the pseudo-trajectories associated to the normalized sequence and to then show that any weak 
limit of this sequence is a solution of the martingale problem (£,C^- (R+,(R+) d x ) where £ is the 
infinitesimal generator defined in Proposition |3.2| Then, proving that uniqueness holds for the 
solutions of the martingale problem and for the invariant distribution, the convergence follows. 
Here, we choose to only detail the key step of the characterization of the limit. The rest of the 
proof can be obtained by a simple generalization of that of di- 

proposition 5.1. Let f be a continuously differentiable function with compact support in M+ _1 . 
We have 


E(/(y n+1)2 ,..., y n+M ) - f(Y n , 2 ,..., y„, d ) \F n ) = jn+i£ d f(Y n , 2 ,..., y n , d ) + 0P (i), 


where £d is the PDMP generator defined in (15) and T n = a(Y^,k < n ). 

Proof. Since the proof does not depend on a, we assume that a = 1 for the sake of clarity. We 
first give an alternative expression for the variables Y n y for i > 2. 

Y n -{-\ t i = Y n i T 'Yn+l f (Pi Pi)Y n ,i^ T 'yn+l£'n,i QAM n ^.\^, 

where C n>i = (nfiX n ) - + Y riti (p 1 - + (e n + ^f^(Pi - X n,jPj ))) = op(l) since (e„)n>o 

i¥=i 

converges 0 and (A4y)n>o converges to 0 in probability for i > 2. We rewrite this as follows 

'1 -Pi 


^n+1,2 — Y n i T y n +i 


d- 1 


— {.Pi — Pi)Y n ,i + C Ut i I -T G„,i + gAM n+ i ti , 


where G n ^ — <?( 1 X n,iPi) and AM n +\i AM n j r \ ^ G 

We consider a function / € C 1 (M^ _1 ) with a compact support. 


d 

f(Y n+ r) - f(Y n ) = f(Yn+1 ,2, -Yn+l,i, ....Yn+l,d) - f(Y n , 2 , ...Yn,i, ■■■■Y n ,„). 

i— 2 

We will use the following notation Fi(Yf) = f(Y n ^, ■ ■■Yk,i, ■■■■Y n < f)- This means that the first i — 1 
variables are ( Y n 2, Y„ t 3,...) and the last d — i ones are: (y„+i,j+i, 54+i,i+2) ..., 54 +m)- We have: 

Fi(Y n+ 1 ,i) - Fi(Y nii ) = Fi{Y n+ i.O - Fi(Y nii ) + - FfiY^), 


where 


and 


Y n ,i — 1 n,i T Tn+1 ( ^ (.Pi Pi^jYn i -T J , 

1 nA — Y n i T G n i. 


We begin by writing: 

Fi(Y n +i,i) — Fi(Y nt i) = diFi(Y nt i)AM n+ ij + 7 n +il4+i,i, 


where the first order Taylor approximation formula yields: 

30 € [0,1] : 14 +i,* = [Fi(Y nti + 9 AM n+ 1 ,f) - FfiY^) 

As a consequence, 14+i,i = op(l) and we are now going to prove that: 

Fi(Y n + 1 ) — FfiYn) — 7„+i AiFi(Y n ) 


AM, 




’ - lim E 


\X n = 0, 


7n+l 
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where: 


Af(Y 2 ,...,Y d ) = P ^(f(Y 2 ,...,Y i +g,...,Y d )-f(Y 2 ,...,Y i ,...Y d )) 


(lj-^--piY?)d i f(Y 2 ,...,Y d ). 


We compute: 

E(F i (y n> i))|J r nji ) = PiX nfi Fi(Y nti +g( 1 - X n ^)(l - PiX nti )) 
+ (1 QPiX n ,i)Fi{Yn,i 

Let us decompose the r.h.s. of the above equation into two parts, denoted by: 

F n ,i = PiX n! i(Fi(Y nti + g( 1 - X n ,i)(l ~PiX nii )) - Fi(Y nti )), 

and 


(42) 

(43) 


G n ,i = (1 - gPiX n ,i)(Fi(Y„ } i - gpiX nti (l - X nd ))) - Fi{Y n ,i))- 

Note that (42) is the jump part of the PDMP and (43) is the deterministic one. If i > 2, {X n< i) n > i 
converges to 0 in probability and p n "fn+ i _1 = g + o(p n ). Thus: 


7ti+1 F n ' i — 7ti+1 PnPi^ n,i(Fi(Y n ,i T g T Op(l)) F^{Y n j / ^ 

PiY n 


g 


(1 + o(p n )) + 3 + op(l)) - Fi(y n ,i)]. 


As a consequence, the asymptotic behavior of (|42|) is given by 

F, 


’ — lim 


rwoo \7„ + i 


n,i v F'i(Y n i+g) — Fi(Y n i) 

-Piin,i - 1=0. 


3 


We now study (43) and compute: 


Y n . i gX n i( 1 Pi.X rl j ) — Ln,i + 7n+l ^ ^ ^ Pl^n/i 

+ 'Yn+lPiYn,i gP%X nd (1 An,) T 7n+l G nd 
'1 “Pi 


— Ln,, T y, 


71+1 


d- 1 


PlAn,, 


T 7ii+l PiYn.i gpnPiYn,i {,1 A n ,?) + 7n+l^+i,i 


:=7»+lC«,i 

where gp n = 7„. Since C n ,7 converges to 0 in probability, we obtain: 

7n+lGn,i 

_ -1 /i , „/ 


1 ~Pl 

d — 1 

Op(i). 


Pl^ 71,7 


Fi(Y n i 'Yn+l 

(^Fi(Y ni i OVi-fl 

-Pi v 

i-1 PlYn '\ 

'^t~PiY n ,i 

+ 771+1^,1,7) — Fi(Y nt i) 
+ 771+1^71,1) — Fi(Y nt i )^ 

7 n+l 

1_Pl m V ■ 
d-1 Pl J rt ,7 



We finally obtain the limiting behavior of (43): 


’ - lim ( FbL _ f 1 Pl - piYj ) diFi(Y n i ) ) = 0. 
\7n+l \ — 1 


This ends the proof of the proposition. 


□ 
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6 Ergodicity of the PDMP 

From now on, the variable (X t )t> o will refer to a trajectory of the PDMP associated with the 
normalized (over)-penalized bandit and bearing no relation to the multi-armed bandit sequence 
(-4i)n>l ■ 


6.1 Wasserstein results 


We begin the study of the ergodicity of the PDMP whose infinitesimal generator is (16) with some 
computations of the moments of the process. 


Lemma 6.1. Let ( X t )t>o be a Markov process, whose generator C is defined by (16). If n := 
b — eg > 0, then supE[(Xf) p ] < C(1 + |a;| p ). In particular, the invariant distribution n has 
moments of any order and 


Vt > 0 E(X t ) = - + (E(X 0 ) - e "^ 

7T V 7T/ 

Proof. Let us define f p (x) = x p . We have: 

Cf p (x) = p(a — bx)x p ~ l + cx((x + g) p — x p ) 

p- 2 

= - pnf p {x)+paf p - 1 {x ) + c p g p ~ k fk+i(x), (44) 

k =0 

where we adopt the convention = 0. If we now define a p {t) = E(X P ), the previous relation 
shows that a p satisfies the ODE for any integer p > 1 defined by 

p-2 

a p (ty + pna p (t) = paa p -i(t) + c E C k g p ~ k a k+l (t). 

k =0 

For example, with p = 1 we have (t) = —7rai(t) + a, which implies that 

a 1 (t) = ^+(E(X 0 )-^)e- t \ 

The control of the moments of order p > 1 then follows from a recursion. □ 


6.1.1 Rescaled two-armed bandit &; Theorem 


In the following, we will exploit Equation (44) to obtain a suitable upper bound of the Wasserstein 
distance W p between the law of X t and the invariant measure p oo of the PDMP. For this purpose, 
we note that the generator © possesses the stochastic monotonicity property, i.e., a coupling 
( X , Y ) exists starting from [x, y) (with x > y) such that X t > Y t for any t > 0. The increase of the 
jump rate (with respect to the position) and the positivity of the jumps are of prime importance 
for this property. Such a coupling could be built as follows: we only allow simultaneous jumps 
of both components or a single jump of the highest one (see (0) for a similar procedure). The 
generator of this coupling ( X , Y) starting from (x, y) with x > y is given by: 


£wf(x, y) = (a- bx)d x f(x, y) + (a - by)d v f(x, y) 

+cy (f(x + g,y + g)- f(x, y)) + c( x - y) {f(x + g,y)~ /( x, y)) 

with a symmetric expression when y > x. We now prove the main result. 


(45) 


Proof of theorem 3.3. Let po be a probability on Ri_ and designate as the invariant distribution 
of the PDMP. Set 


C t = {v € V(M. 2 ),u(dx x 


4) = p t (dx),v( 
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4 x dy) = Pooidy)}. 






For any v S C, let (. X t ,Y t )t>o denote the Markov process driven by (45) starting from v. From the 
definition of W p and the stationary of (Y t ), we have for any t: 

WM, Moo) < inf{^ e Co, ^ E[|X* - Y t y \ p )u(dx, dy) j . 

At the price of a potential exchange of the coordinates, we can now work with some deterministic 
starting points x and y such that x > y > 0. Owing to the monotonicity of £yv, we thus have for 
any p > 1 

E(\X?-Y t y \y=E(X?-Y t y y. 

Assume now that p £ N*, we observe that £yv acts on (x, y) >->• [x — y) p as: 


p-2 


£ w (x - y) p =-pn(x - y) p + pa(x - y) p 1 + c^C k g p k {x-y) 


k +1 


k—0 


Setting /3 p (t) = E \Xf — Y p \ p , we can immediately check that: 


P -2 


Pp(t) + irp/3 p {t) = {pap p - 1 (t) + c'^2c k g p k /3 k+1 (t) 


k—0 


When p = 1, (46) implies that: Pi(t) = /?i(0)e nt => E[Xf — Y p ] = (x — y)e so that: 

Wi(Mt,Moo) < VVi( moiM oo) e_t,r - 

For the lower-bound, we use: 


Wi(Mt 5 MoO ) > inf <j G C t , 
which implies that: 


(x - y)u t (dx,dy) 


Wi(Mt,Moo) > 


E[X t a, -l7'] A io(da;) / i 00 (d I /) 


= |E[Xn-E[F^]|, 


(* - y)yo{dx)p oo (%) 


The lower-bound follows. 

Now, let us consider the case p > 1 (with p € N). For p = 2, we have 

{^(t)e 2 ^)'e~ 2 ^ = (2a + c 5 2 )/3 1 (0) e -" t , 

and an integration leads to /3 2 (i)e 2 ’ I ' <: — /3 2 (0) = 2a + c 9 f g 1 (0)[ e 7rt — 1], As a consequence: 


fo(t) < e- 2 ^/3 2 (0) 


2 a + ccM 


/3i(0)e 


(46) 


Using the inequalities \Ju + v < yfu + y/v and /3 2 > Wf, we thus deduce that: 

W 2 (pt, p-oo) < W 2 (po; Moo) e nt + \j~~ ^ 9 \/W i(M0iMoo)e 2 - 

The result follows when p = 2 by setting: 

I j CCJ^ 

72 := W 2 (p 0 j Moo) + Y - 7! r^VWi(Mo 5 Moo ) • 

A recursive argument based on (46) shows that a constant exists that only depends on p 0 and 
Poo such that: 

wm i Moo ) < 7p e **• 

□ 


28 




















6.2 Proof of total variation results 


As mentioned before, the idea is to wait until the paths get close (with a probability controlled 
by the Wasserstein bound) and then to try to stick them (with high probability). Since the jump 
size is deterministic, sticking the paths implies a non trivial coupling of the jump times which is 
described in the lemma below. 

We begin by establishing the next useful lemma. 


Lemma 6.2. Let e > 0 and t > g ln(l + e). A coupling (X t , Y t )t> o of paths driven by (16) exists 
such that on A Tn 


( C a C£ \ C 

1 - -X 0 £ - e~* cs - —J max(0, 1 - -e(x 0 + g)), 
where A XOjE = {(x,y)|§ < x < x 0 , 0 < x - y < e}. 


Proof Let e > 0 and (x,y) £ A X0)E (in particular, x > y ). Designate Tf and Tf as the first 
jumps of (Xf) and (Xf), respectively, and Tf as the second jump of (Xf ). It can be noted that: 


¥(X t = Y u t> a) > P(X^ H = X v T y , Tf < s). 

We aim to build a coupling that leads to a sharp lower-bound of the r.h.s. For this purpose, note 
that if Tf < Tf < Tf, the triple (Tf, Tf, Tf) satisfies: 


yV _ yx 

-TV rpy - S\-rpy 



(»-!)■ 


— hT y ® 

-vie bT i + g = - 


(xfy 


D e -KT?- T ?) m 


Considering that Xf* = f + (^ ^ f ) e Tl +17 and defining ip(t) = f In (e bt + yy j, we can verify 
that X%,y = Xjf v < s and Tf < Tf < Tf as soon as 


T y = ^(Tf)<s and Tf > V>(Tf), 


since ip(t) > t. We are naturally encouraged to consider Sf’ s = if(Tf)l^( T x )< s } and it is well 
known that the law of (Tf,Tf) can be described through the maximal coupling: 


Tf = QU + (1 - Q)V y , ip{Tf) = QU + (1 - 0)14, 

P jiy AP 2p(T x ) 

where V x ,V y ,Q and U are independent, U ~ lip .‘.ap m — and 0 ~ B(p) where p = (IPs 1 ' 5 A 
P t »||tv With this coupling, if q(t,z) = P(Tf > ijj(t) — t), the Strong Markov property yields 

P(Tf - Tf |(Tf,Tf)) = P(Tf > tp(Tf)\Tf) = q(Tf,X^). 

Since 2 i —> q(t,z ) is increasing and x > a/b (from the assumption on A Xo<e ), we deduce that 
Xf. < x + g and it therefore follows that: 

P (Tf > Tf |(Tf, Tf )) > q(t, x + g)> q(0, x + g). 


given that t i—>• ’ip(t) — t is a non-decreasing function. As a consequence, we obtain that with this 
coupling: 


nX X T y = X y T y , Tf < 3) > q( 0, X + <?)P(0 = 1) = <z(0, X + (?) ||P S - 


■ TP I 


(47) 


It remains to find a lower bound of the total variation distance involved in the r.h.s. of the above 
inequality . Recall that 


r +oo 

HPsyo AP t «|| tv = / f y {t) A g x , s (t)dt, 
Jo 
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where f y and g x>s denote the densities of Tf and S , ) c ’' s , respectively. We therefore have: 


— f cd>(y,u)du n n 

Vf > 0, fy(t) = aj>(y, t)e ° with (j){y, t) = - + {y - -)e~ b , 

and a change of variable yields: 

Vi > 0, g x (t) = /x(^ _1 (i))(-i/'' 1 ) , (i)l{v-(o)<t<4- 
On the one hand, since (x,y) £ A XOtt , we can check that: 

Vi > 0, tj>{x, t ) — ee~ bt < tj>{y , i) < <j>(x, t), 

and we can then conclude that: 

Vi > 0, f y (t) > f x (t) - ee~ bt . 

One the other hand, note that: 


1 


Vi > V'(O), V* X (i) = t In e - 


m x ~y 


9 


< t and (i/j 1 ) , (i) = 


- > 1, 

Tc—y — ’ 

9 


( 48 ) 


and we can deduce from ( |48| ) that Vi £ [^>(0), s]: 

9x{t) > c<i>(x, c ^C x > s ) ds > a f>(x,t)e-rt c +^ da = f x (t). 

Note that we used that 1 H>■ (f>(x,t) is decreasing since x > a/b. Thus, 

(p T y A P s? ) ( dt) > h(t)dt with h(t) = ( f x (t ) - ee _w )l^ (0 )< t < s di. 

As a consequence, 

l|p s *, s ap t »|| tv > e ~ f °' t0) c ^( x ’ u ) du - e -fo c< t>( x ’ u ) du - £. 

Checking that ^(0) < e/b and that Vi > 0, a/b < <j>{x,t) < x < xq, we deduce that 

\\VsZ" A Pt-IU > e~^ - e~* cs - | > 1 - e~^ s - 

where we used e~ u > 1 — u for u > 0 in the second line. To conclude the proof, it remains to plug 
this inequality into (471 and to observe that: 

q{0,x + g) > q{0,x o +g) = e~ J c<j>(x 0 +g,s)ds > 1 _ ci/>{0)(x o + g) > 1- ^e(a:o+v)- 

o 

□ 

We now provide the proof of the ergodicity w.r.t. the total variation distance. 

Proof of Theorem \3.4\ For any starting distribution go, 

IllO-Pt MooIItv a | F) Ft II tv Mo {dy) 9oo (dx). (49) 

The idea is to use the Wasserstein coupling during a time t\ and to then try to stick the paths on 
the interval [ii,i] using Lemma 6.2 Consider A Xo e defined in Lemma 6.2 and the alternative set 
A* 0i£ = {(x,y),a/b < y < x 0 , 0 < y - x < e}. Set B X0:E = A XOj£ U A* Xo e , we have: 


1 - II 6 x P t - S y P t \\ TV > P(A'f = y"|(X£, Y") £ B XOte )F{(Xl,Y t \) £ B Xo>e ). 


(50) 
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Since the Wasserstein coupling preserves the order and since x > a/b /.too (dx)-a.s., it can be noted 
that p ao (dx)-a.s., 


(x^yy) e b x 


X'^ — X^ < e and < Xq if x > y 

X^ — Xf x < e and < x 0 if x < y. 


It follows that for every p > 0, Moc (dx) almost surely: 

n(K, y t \) e B c xo>e ) < n\x^ - xi i > E )+nxi > *„)+ nxi > Xo ) 
< \n\Xt XI |] + ± (E[(X£)*] + E[(A7J P ]) . 

£ Xq 


On the basis of Theorem 3.3 and Lemma 6.1 a constant C p exists such that C p depends on p, po 
and n oo but not on t\ and satisfies: 

[ P {{Xl,Y t \) g B c Xo e )p 0 {dy)f Jloo {dx) < n;i(/X °’ Moo) e -^ + 

J £ x o 

Finally, Lemma [6 .2 1 leads to: 


P(Af = Y?\{X? 1 ,Y? 1 )<=B XOie ) 


> (1 — -xq£ — e 


-fc(i-ti) 


C£ 


~ ~^) ( 0 V 1 “ l £ ( Xo+ ^} 


so that by plugging the previous inequalities into (50) and (49), it can be deduced that for every 
p > 1, a constant C p exists such that for every t > 0, for every xq and e such that x$£ < b/2c 
(with Xq > 1 and e G (0,1)), 

IImo-P* ~ Moo||tv < C p f xo£ + e * 1 ^+eH—e ntl H— p\ • 

\ £ x o / 

If we try to optimize the above bound, we set t± = St , x$ = C\e at , e = C^e - ' 9 * with <5 G (0,1) and 
/3 > a > 0 and deduce that a constant C p exists such that: 

11 Mo -Ft “ MooIItv < Cpexp (-f j/3 - a A y(l - <5) A cbr - /? A apj) . 

We can choose p as large as we want (/to has moments of any order) and thus a arbitrarily small. 
The result then follows using an optimization on (/3, S). □ 


A Technical result for the pseudo-regret upper bound 

Lemma A.l. Let a > 0, 71 £ (0,1) and h £ N such that cry^ < 1 and h > l/(a 7 i) 2 ). We have: 

n —1 n —1 ^ 

Vn>n ^ 7 j TTd a 7i) < - 

i=ri l=j 01 

Proof Let j > h. On the basis of the inequality ln(l + x) > x for x > — 1, we have 


na- a7j ) = exp ln _ ^ exp - 


aji 


1=3 


1=3 


Using that x i-A 1/y/x is decreasing, 


71—1 71—1 71—1 ^ /* 1 

H 7 * = ^ 71 H = 7 1 / “7^ = 27 i(v^ - \Tj) 

1=3 l=j V l =3 l j 
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so that: 


Checking that x H>• 
j > n 0 , 


n —1 n— 1 n— 1 

E^na-^) < 7 ie- 2a7i ^E 

j=no 1=3 j=n Q 

is non-decreasing on [^72,00) 


e 2a7iv? 

~7T' 

it can be deduced that for 


any 


E 

j=n 0 


1 r 2a7iv7 

vf 



JL^i^dx < 

V x 


1 

ayi' 


The lemma follows. 


□ 
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